Back to Main Menu

Repairing Spatial Data

When attempting to import GIS data into a SQL Server, there exist several constraints enforced on the data being loaded for integrity purposes. These constraints may generate error messages that prevent data from being exported properly. As part of Assetic’s database configuration, polygon data exported into the system is subject to similar rigorous standards to prevent the import of broken or ambiguous polygons which lack a coherent and clearly defined structure. Unfortunately, many WKT (Well Known Text) exporting extensions fail in meeting these standards, producing polygon datasets that are incompatible with Assetic’s database standards. As a workaround to ‘broken’ polygons, Assetic recommends running an automated repair method such as Prepair, PostGIS 2.0’s ST_MakeValid() function, or other such tools.

Using the Prepair Tool

The Prepair tool can be downloaded from the following link:

https://github.com/tudelft3d/prepair/releases/download/v0.7.1/prepair-x64-0.7.1.zip

 

The file should be downloaded and extracted into a folder of choice. The tool can be executed via the following set of steps:

 

Launch the Windows Command Shell\Command Prompt. This can be accessed by running ‘cmd’ from the start menu.

 

Navigate to the folder into which the contents of the Prepair folder were extracted. This can be achieved by running the command ‘cd <folder path>’, without the quotation marks, and where <folder path> is the file path to the folder containing the contents.

 

From here, typing in the command ‘prepair’ will give a quick overview of the module’s functionality:

 

The whole suite of functionality is beyond the scope of this document. The focus of this article covers the ‘prepair –wkt’ option, which takes a Polygon input and performs shape correction to return a corrected Multipolygon output. To run it, the user runs the following command:

 

‘prepair –wkt “POLYGON” ‘, where POLYGON is simply the Polygon to be corrected.

Important: It is mandatory to encase the word polygon in double quotation marks.

A sample execution is provided below:

 

The Multipolygon output is suitable for directly exporting into Assetic’s systems.

Handling large polygons

On occasion, the need to process exceptionally large and intricate polygons may arise. Such polygon data is usually characterized by having many points at a high degree of measuring precision, leading to many coordinate points, with each coordinate point having many decimal points. The Windows PowerShell has an inherent limitation of inputs of up to 8191 characters, and in many cases, large polygons will quickly meet or exceed the input limitations. It is not always feasible or simple to truncate the data to be repaired. In such cases, the following set of procedures is the recommended workaround.

 

The Prepair tool has the functionality of repairing Polygon data from text files, which can be utilized to work around the PowerShell limitation.

 

Paste the polygon data into a text file and save it into the same location as the Prepair folder.

 

As can be observed in the screenshot, the quotation marks are unnecessary when loading polygon data through a .txt file.

 

Next, running the following command on the command line:

 

Prepair -f <TEXTFILE>, where <TEXTFILE> is the .txt file containing the target polygon data.

 

Please note that the above method only repairs one polygon per .txt file and so users should take care to avoid storing multiple lines of Polygons in the .txt file to attempt to bulk repair many polygons at once.

Bulk Repair Polygons

To perform a bulk repair of multiple polygons at once, a scripted approach is recommended. A sample script utilizing Python and the subprocess module is available as a reference below:

NOTE  To learn more about the Python programming language, and for additional information on available Assetic Python packages, please see the Assetic Python SDK - Quick Start article.

This sample script parses an auto-generated error file (.csv) created by the Assetic Cloud Data Exchange if import errors occur during a spatial import. The repaired spatial data is appended to additional columns with the suffix "_Fixed" in the output file.

 

The sample script also assumes that the standard csv template for spatial imports from the Data Exchange Import Wizard (Mapping -> Asset Spatial) is being used. This is for the purposes of reading and parsing the error file data using the standard template column headers ("Asset ID", "Point", "Polygon", "Line").

 

If using the sample script the file paths found on lines 127, 130, and 131 must be modified. These paths point towards the directory of the extracted 'Prepair' tool, the input file path (.csv), and a nominated output file (.csv) file path. 

NOTE  

If the output file is used for a subsequent Data Exchange import, care should be taken during the mapping steps of the Import Wizard to change the mapping to the new column containing the repaired spatial data.

 

  1. from subprocess import Popen, PIPE
  2. import pandas as pd
  3. import os
  4. import re
  5. """
  6. Assetic.BulkPrepair.py
  7. This script demonstrates how to repair broken spatial data in Well Known Text (WKT) format using the Prepair tool.
  8. The script parses the auto-generated error file (.csv) from the Assetic Cloud Data Exchange import:
  9. Mapping -> Asset Spatial
  10. Errors are filtered and repaired depending on the type of the WKT (i.e Polygons, Multi-Polygons and Linestrings).
  11. - Polygons and Multi-Polygons are repaired using the Prepair utility
  12. - Linestrings are attempted to be repaired by checking for consecutive duplicate Points
  13. The repaired data is then appended as new columns when saved to the .csv output file.
  14. Prepair download link: https://github.com/tudelft3d/prepair/releases
  15. Full documentation: https://assetic.zendesk.com/hc/en-us/articles/360000516615
  16. Author: Shien Jinn Sam (Assetic)
  17. """
  18. def repair_linestring(broken_linestrings):
  19. # isolate points from linestring in order to compare and remove consecutive duplicates
  20. fixed_linestrings = []
  21. for linestring in broken_linestrings:
  22. # find if linestring or multilinestring through WKT prefix
  23. wkt_prefix = str.split(linestring, "(")[0].strip()
  24. valid_prefix = ['LINESTRING', 'MULTILINESTRING']
  25. if wkt_prefix.upper() not in valid_prefix:
  26. raise ValueError(f"Invalid Line Prefix encountered: {wkt_prefix}.\n"
  27. f"Valid Values:{valid_prefix}")
  28. if wkt_prefix.upper() == 'MULTILINESTRING':
  29. # remove wkt prefix and strip enclosing outer parenthesis
  30. wkt_content = linestring[len(wkt_prefix):].strip()[1:-1]
  31. # split wkt on parentheses to get list of each line that comprises the multiline
  32. wkt_multilines_list = re.findall(r"\((.*?)\)", wkt_content)
  33. # parse each line individually to repair
  34. fixed_sub_lines = list()
  35. for line in wkt_multilines_list:
  36. # parse the line and remove repeated points
  37. parsed_line = _del_repeated_linestring_pnts(line)
  38. # put the linestring back inside parenthesis and append to list
  39. fixed_sub_lines.append("(" + parsed_line + ")")
  40. # put all lines from the multiline back into a single WKT string
  41. fixed_multiline = "MULTILINESTRING(" + str(", ".join(fixed_sub_lines)) + ")"
  42. fixed_linestrings.append(fixed_multiline)
  43. elif wkt_prefix.upper() == 'LINESTRING':
  44. # Remove leading "LineString" string and brackets for just comma separated points
  45. # Also Strip leading and trailing spaces, and separate each point into a list
  46. wkt_content = linestring[linestring.find("(") + 1:]
  47. wkt_content = wkt_content[0:wkt_content.find(")"):]
  48. parsed_line = _del_repeated_linestring_pnts(wkt_content)
  49. # reconstruct LINESTRING in valid WKT format
  50. fixed_linestrings.append("LINESTRING(" + parsed_line + ")")
  51. return fixed_linestrings
  52. def _del_repeated_linestring_pnts(wkt_string: str):
  53. """
  54. Accepts a WKT Linestring in a str format, and removes any vertices (points)
  55. that do not adhere to duplicate vertex rules.
  56. To adhere to OGC standards, the Linestring is split into individual points
  57. and the points are compared for duplicates. A well formatted Linestring
  58. will not have consecutive duplicate vertices (points), and will also not
  59. have overlapping segments where duplicate vertices with a single vertex is
  60. between them.
  61. """
  62. # seperate wkt string into a list of points for comparison
  63. raw_points_list = [item.strip() for item in wkt_string.split(',')]
  64. # Get total number of points, minus 1 to adjust for 0 index
  65. total_points = len(raw_points_list)
  66. # Skip linestrings with only 2 points, deleting 1 would make it no longer a Linestring
  67. if total_points <= 2:
  68. return wkt_string
  69. # Start from index 0 and iterate over full points list to check for vertex OGC compliance
  70. i = 0
  71. while i < (total_points - 1):
  72. if total_points < 2:
  73. print("Invalid LINESTRING detected! Repaired linestring is less than 2 vertices.")
  74. # check immediate neighbor duplicates
  75. if raw_points_list[i] == raw_points_list[i + 1]:
  76. del raw_points_list[i + 1]
  77. total_points -= 1
  78. # also move i iterator back 1 to allow regressive leapfrog check after change
  79. if i > 0:
  80. i -= 1
  81. continue
  82. # check for duplicate line segment, leapfrog 1 point and check if that point is duplicate
  83. lf = i + 2
  84. if lf <= (total_points - 1):
  85. if raw_points_list[i] == raw_points_list[lf]:
  86. del raw_points_list[lf]
  87. total_points -= 1
  88. continue
  89. # iterate and continue checks
  90. i += 1
  91. # join the points back to a str and return
  92. rejoined_points = ", ".join(raw_points_list)
  93. return rejoined_points
  94. def prepair_polygons(prepair_dir, broken_polygons):
  95. path = prepair_dir
  96. # initiate empty list for storing the fixed polygons
  97. outputs = []
  98. # create loop
  99. for polygon in broken_polygons:
  100. try:
  101. # defaults to using command line for repairing but also check the polygon isn't bigger than 32768 characters
  102. command = ['prepair', '--wkt', polygon]
  103. # collect output and error data as byte types
  104. process = Popen(command, cwd=path, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
  105. out, err = process.communicate()
  106. except FileNotFoundError as e:
  107. # Command line limit on CreateProcess() function has 32768 char limit for args
  108. # Handle this and treat it as the command line is too long
  109. if len(polygon) >= 32768:
  110. out = None
  111. err = 'The command line is too long.'.encode(encoding='ascii')
  112. else:
  113. # unknown error not yet handled, rethrow
  114. raise e
  115. # If polygon is too large, create a .txt file for loading to overcome Windows command shell limitations.
  116. # This is computationally expensive, but for all practical purposes, it will run relatively quickly.
  117. # Only polygons>8191 characters/points will require this.
  118. if err.decode("ascii").strip() == 'The command line is too long.':
  119. # Write and load polygon from a .txt file to sidestep windows shell command character input limitation
  120. # The previous polygon will be overwritten in the next cycle of the for loop
  121. f = open(os.path.join(path, 'Polygon.txt'), 'w')
  122. f.write(polygon)
  123. f.close()
  124. # specify command line command
  125. command = ['prepair.exe', '-f', "polygon.txt"]
  126. # initiate child process
  127. process = Popen(command, cwd=path, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE)
  128. out, err = process.communicate() # collect output and error data as byte types
  129. outputs.append(out.decode("ascii").strip()) # append decoded bytes and remove trailing whitespace
  130. return outputs
  131. def main(prepair_dir, input_file_path, output_file_path):
  132. # Read data from .csv error file into a pandas datatable
  133. errors = pd.read_csv(input_file_path, dtype=str)
  134. # create a list to hold the output dataframes
  135. fixed = list()
  136. # 1. Get Polygons/Multipolygons data and pass the 'Polygon' WKT column to Prepair
  137. broken_polygons = errors[~errors["Polygon"].isnull()].copy()
  138. polygons = broken_polygons["Polygon"]
  139. if not polygons.empty:
  140. fixed_polygons = prepair_polygons(prepair_dir, polygons)
  141. broken_polygons["Polygon_Fixed"] = fixed_polygons
  142. fixed.append(broken_polygons)
  143. # 2. Get Linestring and remove consecutive duplicate Points (common issue) MULTILINESTRING rows are not handled
  144. broken_linestring = errors[~errors["Line"].isnull()].copy()
  145. linestrings = broken_linestring["Line"]
  146. if not linestrings.empty:
  147. fixed_linestrings = repair_linestring(linestrings)
  148. broken_linestring["Line_Fixed"] = fixed_linestrings
  149. fixed.append(broken_linestring)
  150. data_out = pd.DataFrame()
  151. # if both Polygons and Lines have been repaired, merge the fixed data together
  152. if len(fixed) > 1:
  153. data_out = pd.concat(fixed)
  154. elif len(fixed) == 1:
  155. data_out = fixed[0]
  156. # Write to the output file
  157. if not data_out.empty:
  158. data_out.to_csv(output_file_path, index=False)
  159. else:
  160. print("Output pandas dataframe is empty. No data has been repaired.")
  161. # This will run the method main() to start the repair process
  162. # Please update the file paths below for your prepair directory and input/output files.
  163. if __name__ == "__main__":
  164. # Set prepair directory, download link @https://github.com/tudelft3d/prepair/releases
  165. prepair_directory = "D:/temp/prepair-x64-0.7.1"
  166. # Set paths here for the input and output CSV files
  167. input_file = "C:/temp/broken_data.csv"
  168. output_file = "C:/temp/fixed_data.csv"
  169. # Run the repair process
  170. main(prepair_dir=prepair_directory,
  171. input_file_path=input_file,
  172. output_file_path=output_file)